Measuring the impact of character recognition errors on downstream text analysis

نویسنده

  • Daniel P. Lopresti
چکیده

Noise presents a serious challenge in optical character recognition, as well as in the downstream applications that make use of its outputs as inputs. In this paper, we describe a paradigm for measuring the impact of recognition errors on the stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. Employing a hierarchical methodology based on approximate string matching for classifying errors, their cascading effects as they travel through the pipeline are isolated and analyzed. We present experimental results based on injecting single errors into a large corpus of test documents to study their varying impacts depending on the nature of the error and the character(s) involved. While most such errors are found to be localized, in the worst case some can have an amplifying effect that extends well beyond the site of the original error, thereby degrading the performance of the end-to-end system.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Impact of Measuring Devices and Data Analysis on the Determination of Gas Membrane Properties

The time-lag method, using a gas permeation experiment, is currently the most popular method for determining the membrane properties: diffusivity coefcient and permeability coefcient, and from which the solubility coefcient can be calculated. In this investigation, the impact of systematic, random (noise), resolution and extrapolation errors associated with gas permeatio...

متن کامل

A Technical Report: Entity Extraction using Both Character-based and Token-based Similarity

Entity extraction is fundamental to many text mining tasks such as organisation name recognition. A popular approach to entity extraction is based on matching sub-string candidates in a document against a dictionary of entities. To handle spelling errors and name variations of entities, usually the matching is approximate and edit or Jaccard distance is used to measure dissimilarity between sub...

متن کامل

Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context

Creating document image datasets with ground-truths of regions, text lines and characters is a prerequisite for document analysis research. However, ground-truthing large datasets is not only laborious and time consuming but also prone to errors due to the difficulty of character segmentation and the large variability of character shape, size and position. This paper describes an effective reco...

متن کامل

Neural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten

Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...

متن کامل

Impact of imperfect OCR on part-of-speech tagging

Part-of-speech (POS) tagging is the foundation of natural language processing (NLP) systems, and thus has been an active area of research for many years. However, one question remains unanswered: How will a POS tagger behave when the input text is not error-free? This issue can be of great importance when the text comes from imperfect sources like Optical Character Recognition (OCR). This paper...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008